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(57) ABSTRACT 

A system and method for facilitating both in-order and 
out-of-order packet reception in a SAN mcludes requester 
and responder nodes that maintain local copies of a message 
sequence number. Each request packet includes an ordering 
field specifying whether the packets must be received 
in-order. The request node includes a copy of the local 
sequence number in each packet transmitted and increments 
its local copy of the sequence number only for packets that 
must be received in order. The responder node includes the 
received message sequence number in all response packets 
and increments its local copy of the message sequence 
number only if the ordering field specifies that the packets 
must be received in order. 

5 Claims 3 Drawing Sheets 



D9SCriptDfs in the 
Work Quaua 



Send (two packata) ^ 



Sand (onftpacksQ 
Bond (w» pectot) 



Responder 




RSp.SN = r 
Rtp.SN-8 



3 1 O j^f^/ 



Sequence Number Example 



11/03/2003, EAST Version: 1.4.1 



U.S. Patent Dec. lO, 2002 sheet 1 of 3 



US 6,493^43 Bl 



ServerNet II protocol layers 
End Node VIA NIC Routing Node 



V! Session Layer 


Transaction 




Transaction 


Packet 




Packet 


Unk 




Link 


MAC 




MAC 


Physical 




Physical 



Routing Layer 


Packet 


• • • 


Packet 


Link 


Link 


MAC 


MAC 


Physical 


Physical 



FIG. 1 



ServerNet II System Area Network 



Server Net 



SNet 
NIC 









SNet 




SNet 




SNet 


NIC 




NIC 




NIC 













Fiber Channel 
' Arbitrated Loop 
(FC-AL) 

i 

I 
I 



Router Node(s) 



L 


SNet 
NIC 




SNet 
NIC 


• • • 


SNet 
NIC 


Typical End Nodes: 
•CPU 

• I/O Controller 


PCI 


f 

Gigabit 




i 

ATM 



» PC, Workstation 



Ethernet 

FIG. 2 





SNet 
NIC 







11/03/2003, EAST Version: 1.4.1 



U.S. Patent Dec. lO, 2002 sheet 2 of 3 



US 6,493,343 Bl 



Fault Tolerant ServerNet II System Area Network 

Single Ported NICs 



Server Net 



SNet 
NIC 



SNet 
NIC 









z: 



SNet 
NIC 



• • • 



SNet 
NIC 



(FC-AL) 

Pair of controllers, 
<^ each with two links 
into one fabric 

i 
I 



Router Node(s) 



► PC, Workstation 



Ethernet 

FIG. 3 







SNet 
NIC 





L 


SNet 
NIC 




SNet 
NIC 


• • • 


SNet 
NIC 


Typical End Nodes: 
•CPU 

• I/O Controller 


PCI 


Gigabit 




ATM 



Dual Ported NICs 





SNIDq 


SNet II 




Endnode 




A 


SNID^ 




SNIDq 






SNet II 




Endnode 


SNID^ 


B 



Four Logical Paths between SNet II Endnodes 

FIG. 4 



11/03/2003, EAST Version: 1.4.1 



U.S. Patent Dec. lO, 2002 sheet 3 of 3 US 6,493^43 Bl 



FIG. 5 



Descriptors in the 
Work Queue 

Send (one packet) 
Send (two packets) ^ 

Send (three packets) 



RDMA Op 
(2500 bytes, 
sent unordered) ^ 



Send (one packet) 
Send (one packet) 




0 


A 


1 






c 


1 




0 


D 


1 




0 
1 


E 






0 


F 


1 




Requestor 

Rqst.SN = 0 

Rqst. SN = 1 

Rqst. SN = 2 

Rqst. SN = 3 

Rqst. SN = 4 

Rqst. SN = 5 

Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 



Rqst. SN = 6 
Rqst. SN = 7 
Rqst. SN = 8 



Responder 




Rsp. SN 

Rsp. SN 

Rsp. SN 

Rsp. SN 
Rsp. SN 

Rsp. SN 

Rsp. SN : 
Rsp. SN : 
Rsp. SN ■■ 
Rsp. SN ■■ 
Rsp. SN 



= 0 

= 1 

= 2 

= 3 
= 4 

= 5 

6 
6 
6 
6 
6 



Rsp. SN = 6 

Rsp. SN = 7 
Rsp. SN = 8 



FIG. 6 Sequence Number Example 



11/03/2003, EAST Version: 1.4.1 



us 6,4^ 

1 

SYSTEM AND METHOD FOR 
IMPLEMENTING MULTI-PATHING DATA 
TRANSFERS IN A SYSTEM AREA 
NETWORK 

CROSS-REFERENCES TO EARLIER 
APPLICATION 

This application claims priority from Provisional Appli- 
cation No. 60/070,650, filed Jan. 7, 1998, the disclosure of 
which is hereby incorporated by reference. 

BACKGROUND OF THE INVENTION 

Traditional network systems utilize cither channel seman- 
tics (send/receive) or memory semantics (DMA) model. 
Channel semantics tend to be used in I/O environments and 
memory semantics in processor environments. 

In the channel semantics model, the sender does not know 
where data is to be stored, it just puts the data on the chaimel. 
On the sending side, the sending process specifies the 
memory regions that contain the data to be sent. On the 
receiving side, the receiving process specifies the memory 
regions where the data will be stored. 

In the memory semantics model, the sender directs data to 
a particular location in memory utilizing remote direct 
memory access (RDMA) transactions. The initiator of the 
data transfer specifies both the source buffer and destination 
buffer of the data transfer There are two types of RDMA 
operations, read and write. 

The virtual interface architecture (VIA) has been jointly 
developed by a number of computer and software compa- 
nies. VIA provides consumer processes with a protected, 
directly accessible interface to network hardware, termed a 
virtual interface. VL\ is especially designed to provide low 
latency message communication over a system area network 
(SAN) to facilitate multi-processing utilizing clusters of 
processors. 

A SAN is used to interconnect nodes within a distributed 
computer system, such as a cluster. The SAN is a type of 
network that provides high bandwidth, low latency commu- 
nication with a very low error rate. SANs often utilize a 
fault-tolerant network to provide continued message com- 
munications in the even of failure. 

It is important for the SAN to provide high reliability and 
high-bandwidth, low latency communication to fulfill the 
goals of the VIA. 

SUMMARY OF THE INVENTION 

According to one aspect of the present invention, a SAN 
maintains local copies of a sequence number for each data 
transfer transaction at the requestor and responder nodes. 
Each data transfer is implemented by the SAN as a sequence 
of request/response packet pairs. An error condition arises if 
a rehouse to any request packet is not received at the 
requesting node. Each request packet includes an ordering 
field which specifies whether or not the packets must be 
received at the responder in the order that they were sent. At 
the requestor and responder nodes, the local copy of the 
sequence number is incremented only if the ordering field in 
the packet sent or received, respectively, specifies that the 
packets must be received in the order sent. 

According to another aspect of the invention, a sliding 
window protocol is utilized that allows a requestor to 
continue to send a specified number of request packets 
before receiving the matching response packets. 

According to another aspect of the invention, RDMA 
transactions may be implemented utilizing multiple paths to 
increase bandwidth. 
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Other feamres and advantages of the invention will be 
apparent in view of the following detailed description and 
appended drawings. 

5 BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram depicting ServerNet protocol 
layers implemented by hardware, where ServerNet is a SAN 
manufactured by the assignee of the present invention; 

FIGS. 2 and 3 are block diagrams depicting SAN topolo- 
gies; 

FIG. 4 is a schematic diagram depicting logical paths 
between end nodes of a SAN; 

FIG, 5 is a schematic diagram depicting routers and links 
IS connecting SAN end nodes; and 

FIG. 6 is a graph depicting the transmission of request and 
response packets between a requestor and a responder end 
node. FIG. 6 shows the sequence munbers used in packets 
for three Send operations, an RDMA operation, and two 
20 additional Send operations. The diagram shows the 
sequence numbers maintained in the requester logic, the 
sequence number contained in each packet, and the sequence 
numbers maintained at the responder logic. 

25 DESCRIPTION OF THE SPECIFIC 

EMBODIMENTS 

The preferred embodiments will be described imple- 
mented in the ServerNet II (ServerNet) architecture, manu- 
factured by the assignee of the present invention, which is a 
layered transport protocol for a System Area Network 
(SAN) optimized to support the Virtual Interface (VI) archi- 
tecture session layer which has stringent user-space to 
user-space latency and bandwidth requirements. These 
requirements mandate a reUable hardware (HW) message 
transport solution with minimal software (SW) protocol 
stack overhead. The ServerNet II protocol layers for an end 
node VI Network Interface controller/Card (NIC) and for a 
routing node are illustrated in FIG. 1. A single NIC and VI 
session layer may support one or two ports, each with its 
^ associated transaction, packet, link-level, MAC (media 
access) and physical layer. Similarly, routing nodes with a 
common routing layer may support multiple ports, each with 
its associated link-level, MAC and physical layer. 

Support for two ports enables ServerNet II SAN to be 
configured in both non-redimdant and redundant (fault 
tolerant, or FT) SAN configurations as illustrated in FIG, 2 
and FIG. 3. On a fault tolerant network, a port of each end 
node may be connected to each network to provide contin- 
ued VI message communication in the event of failure of one 
of the SANs. In the fault tolerant SAN, nodes may also be 
ported into a single fabric, or single ported end nodes may 
be grouped into pairs to provide duplex FT controllers. The 
fabric is the collection of routers, switches, connectors, and 
cables that connects the nodes in a network. 

The following describes general ServerNet II terminology 
and concepts. The use of the term "layer" in the following 
description is intended to describe functionality and does not 
imply gate level partitioning. 
60 Two ports are supported on a NIC for both performance 
and fauh tolerance reasons. Both of these ports operate 
under the same session layer VIA engine. That is, data may 
arrive on any port and be destined for any VI. Similarly, the 
Vis on the end node can generate data for any of these ports. 
65 ServerNet II packets are comprised of a series of data 
symbols followed by a packet fi-aming command. Other 
commands, used for flow control, virtual channel support. 
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and other link maQagement fuDctions,„maY, be^embedded^ Dual Ports and Ordering 

withiD a packet iEacHirequest^orMr^adnsff^pacte ^^ aT , <• • . 

variety of infoL'aulWto'r«r(?utmk,'''tr'iDsalction type, ^ . ^1^ w:tb two ports. .1 ,s i^ssjble for a single VIA 

verification, length and VI specific infoVmalion/ interface to process Sends and RDMA operations from 

i. Routing in the ServerNet II SAN is destination based using 5 ^^^"l different Vis in paraUeUt is also possible for a large 
the first 3 bytes of the packet. Each NIC end node port in ^J^^ ^ ° transferred on both of 
the network is uniquely defined by a 20 bit Port SNID ^^^'^ sunultaneously. This latter feature is called Multi- 
(ServerNet Node ID). The first 3 bytes of a packet cootain 

the Destination port*s SNID or DID (destination port ID) ServerNet II end nodes can connect both their ports to a 

field, a three bitAdaptive Control Bits (ACB) field and the lo single network fabric so that there are up to four possible 

fabric ID bit. The ACB is used to specify the path paths between ServerNet B end nodes. Each port of a single 

(deterministic or link-set adaptive) used to route the end node may have a unique ServerNet JD (SNID). FIG. 4 

packet to its destination port as described in the following depicts the four possible paths that End node Acan use when 

section. sending request to End node B: 

ii. The transaction type fields define the type of session layer IS 1) End node A SNID[0] to End node B SNID[0] 
operation that this ServerNet II packet is carrying and 2) End node ASNID[0] to End node B SN1D[1] 
other information such as whether it is a request or a 3^ node ASNIDFl] to End node B SNIDTO] 
response and if a response, whether it is an Ack node ASNID[1] to End node B SNID[1] 
(acknowledgment) or a Nack (negative yM^ll^l^^J^^^^r^^l^A±^lJ^l^yJ'^u^^^^^J^l^i±^lJ^^ 

acknowledgment), the ServerNet II SAN also supports 20 .. f °; ' deP"='^ * K ""^^^ . 

other transaction types. '° nodes A-F, each having first and second 

iii. Transaction verification fields include the source port ID P°^'^ " ^'^ ^' 1% f "Pl'^^ " ServerNet 
(SID) and a Transaction Serial Number. The transaction fP°^°Sy includmg routers R1-R5. Lmks are represented by 
serial number enables a port with multiple requesU out- «f "P^^S P^'** }° """f '° 

standing to uniquely match responses to requests. 25 "^P>^' », ^^''P^^^ ^ • f"'^'^ 

iv. The Ungth field consists of an encoding of the number ^ ^ ""J » «^«'°'' '"'"P"^*^ ^et (fat pipe) 4 couples routers 
of bytes of pay load data in the packet. Payloads up to 5 12 . * . , ... , , . 
bytes are supported and code space is reserved for future deterministic or Imk stl adaptive. An 
increases in payload size. "^^P"^^ ^' °f '^"o*) ° 

V. The VI Session Uyer specific fields describe VI infor- 30 T° J^TJ^^^^l ^^"^ 1 ^'^"u 

mation such as the VI Operation, the VIA Sequence band.«dth. The Adaptive Control Bits (ACB) specify which 

number, and the Virtual Interface ID number. The VI ^yP^ f^"'?"? ^ P*"?^"''^ P^^^"^^}- ^ 

Operation field defines the type of VI transaction being Determmistic roiitmg preserves strict ordermg for packets 

sent (Send, RDMARead, RDMA Write) and other control \ particular source port to a destinaUon port In 

infotmation such as whether the padcet is ordered or 35 determmistic routing the ACB field selects a single path or 

unordered, whether there is immediate data and/or ^nc through an adaptive hnk-set. Send transactions for a 

whether this is the first or last packet in a session layer P*^^^'" VI reqmre stnct ordering and therefore use deter- 

multi-packet transfer. Based on the VI transaction type rou ng. , ^ . , 

and control information, a 32 bit Immediate data field or RDMA transactions, on the other hand, may make "se of 

a 64 bit Virtual address may foUow the VI ID number. 40 »U Possible paths in the network without regard for the 

vi. Tie payload data field carries up to 512 bytes of data "'^^""S P^J^f * transaction These transac 
between requestors and respoaders and may contain a pad T^'^iy ?^ ^''^E^V* routmg as described below. 
uyf The ACB field speafies which specific link (or lane) in this 

vii. The CRC field contains a checksum computed over the link-set is to te used for detenninislic routm^^ 

entire oacket 45 Alternatively, the ACB field can specify link-set adaptiv- 

ity whidi enables the packets to dynamically choose from 

Transaction Overview any of the links in the link-set. 

A sample topology with several different examples of 

The basic flow of transactions through the ServerNet 11 multipathing using link and path adaptivity is shown in FIG. 

SAN will now be described. VI requires the support of Send, 5. 

RDMA read and RDMA write transactions. These are trans- Multipathing allows large block transfers done with 

lated by the VI session layer into a set of ServerNet II RDMA Read or Write operations to simultaneously use both 

transactions (request/response packet pairs). All data trans- ports as well as adaptive links between the two communi- 

fers (e.g., reading a disk file to CPU memory, dumping large eating NlCs. Since the data transfer characteristics of any 

volumes of data from a disk farm directly over a high-speed one VI are expected to be bursty, multipathing allows the 

communications link, one end node simply interrupting end node to marshal all its resources for a single transfer, 

another) consist of one or more such transactions. Note that multipathing does not increase the throughput of 

multiple Send operations from one VI. Sends from one VI 

Creating a Request Packet m\xs\ be sent strictly ordered. Since there are no ordering 

The VI User Agent provides the low level routines for 60 g^ff tees between packets originating from different ports 

VIA Send, RDMA Write, and RDMA Read operations. ^-^y P°" ^ f ^ P" Send Furthermore, 

These routines place a descriptor for the desired transfer in *"^y » ^"^^^l o'dejed path through the Network may be used, 

the appropriate VI queue and notify the VIA hardware that described m the following, 

the descriptor is ready for processing. The VIA hardware Transaction and Packet Uyers 
reads the descriptor, and based on the descriptor contents, 65 

builds the ServerNet request packet header and assembles The transaction layer buUds the ServerNet II request 

the data payload (if appropriate). packet by filling in the appropriate SID, Transaction Serial 
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Number (TSN), and CRC. The SID assigned to a packet response packet. The response packet must be returned to 

always corresponds to the SNID of the port that the packet the original source port. The path taken by the response is 

originates from. The TSN can be used to help the port not necessarily the reverse of the path taken by the request, 

manage multiple outstanding requests and match the result- The network may be configured so that responses take very 

ing responses uniquely to the appropriate request. The CRC 5 different paths than requests. If strict ordering is not 

enables the data integrity of the packet to be checked at the reqmred, the response, like the request, may use link-set 

end node and by routers enroute. adaptivity. The response packet is routed back to the SNID 

FoUowing the ServerNet II link protocol, the packet is specified by the SID field of the request. The ACB field of 

encoded in a series of data symbols followed by a status request packet is also duplicated for the response packet, 

command. The ServerNet II link layer uses other commands 10 The response can be matched with the request using the 

for flow control and link management. These commands TSN and the packet validity checks. If an ACK response 

may be inserted anywhere in the link data stream, including passes these tests, the transaction layer passes the response 

between consecutive data symbols of a packet. Finally, the data to the session layer, frees resources associated with the 

symbols are passed through the MAC layer for transmission request, and reports the transaction as complete. If a NACK 

on the physical media to an intermediate routing node. 15 response passes these tests, the end node reports the failure 

of the transaction to the session layer. If a valid ACK/NACK 

Routing response is not received within the allotted time limit, a 

The routing control function is programmable so that the time-out error is reported, 

packet routing can be changed as needed when the network The requestor can stream many stricUy ordered ServerNet 

configuration changes (e.g., route to new end nodes). Router H messages onto the wire before receiving an acknowledg- 

nodes serve as crossbar switches; a packet on any incoming ment. The sliding window protocol allows the requestor to 

(receive) side of a link can be switched to the outgoing have up to 128 packets outstanding per VI. 

(transmit) side of any link. As the incoming request packet The hardware can operate in one of two modes with 

arrives at a router node, the first three bytes, containing the respect to generating multiple outstanding request packets: 

DID, and ACB fields, are decoded and used to select a link i. The hardware can stream packets firom the same VI 

leading to the destination node. If the transmit side of the send queue onto the wire, and start the next descriptor before 

selected link is not busy, the head of the packet is sent to the receiving all the acknowledgments from the current descrip- 

destination node whether or not the tail of the packet has tor. This is referred to as "Next Descriptor After Launch" or 

arrived at the routing node. If the selected link is busy with NDAL. 

another packet, the newly arrived packet must wait for the 2. The hardware can stream packets to a single descriptor 

target port to become free before it can pass through the ^Qto the wire but wait for all the outstanding acknowledg- 

crossbar. ments to complete before starting the next descriptor. This is 

As the tail of the packet arrives, the router node checks the referred to as "Next Descriptor after Ack" or NDAA 

packet CRC and updates the packet status (good or bad). The 35 -phe choice of NDAL or NDAA modes of operation is 

packet status is carried by a link symbol TPG (this packet determined by how strongly ordered the packets are gener- 

good) or TPB (this packet bad) appended at the end of the ated. 

packet. Since packet status is checked on each link, a packet Ordered and unordered messages may be mixed on a 

status transition (good to bad) can be attributed to a specific ^^^^ y, ^^^^ generating an unordered message, the 

link. The packet routmg process descnbed above is repeated ^ requestor must wait for completion of all acknowledgments 

for each router node in the selected path to the desUnation ^ unordered packets before starting the next descriptor, 
node. 

Ordering of Send Packets Presented to Transaction 

Receiving a Request Packet Layer 

When the request packet arrives at the destination node, 45 The VI architecture has no explicit ordering rules as to 

the ServerNet II interface receiver checks its validity (e.g., how the packets that make up a single descriptor are ordered 

must contain correct destination node ID, the length is among themselves. That is, VIA only guarantees the mes- 

correct, the Fabric bit in the packet matches the Fabric bit sage ordering the client will see. For example, VIA requires 

associated with the receiving port, the request field encodes that Send descriptors for a particular VI be completed in 

a valid request, and CRC must be good,). If the packet is 50 order, but the VIA^ccification doesn't say how the packets 

invahd for any reason, the packet is discarded. The Server- will proceed on the wire. 

Net II interface may save error status for evaluation by The ServerNet D SAN requires that all Send packets 

software. If these validity checks succeed, several more destined for a particular VI be delivered by the SAN in strict 

checks are made. Specifically, if the request specifies an order. As long as deterministic routing is used, the network 

RDMA Read or Write, the address is checked to ensure ss assiures strict ordering along a path from a particular source 

access has been enabled for that particular VI. Also, the node to a particular destination node. This is necessary 

input port and Source ID of the packet are checked to ensure because the receiving node places the incoming packets into 

access to the particular VI is allowed on that input port from a scatter list. Each incoming packet goes to a destination 

the particular Source. If the packet is valid, the request can determined by the sum total of bytes of the previous packets, 

be completed. 60 The strict ordering of packets is necessary to preserve 

integrity of the entire block of data being transferred because 

Response Packet incoming packets are placed in consecutive locations within 

A response is created based on the success (ACK the block of data. Each packet has a sequence number to 

response) or failure (NACK response) of the request packet. allow the receiver to detect an out of order, missing, or 

A successful read request, for example, would include the 65 repeated packet. 

read data in the ACK response. The source node ID from the There are two ways for an end node to meet these ordering 

request packet is used as the destination node ID for the requirements; 
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a. The end ncxle can wait for the acknowledgment from 
each Send packet to complete before starting another Send 
packet for that VI. By waiting for each acknowledgment the 
end node doesnU have to worry about the network providing 
strict ordering and can choose an arbitrary source port, 
adaptive link set, and destination port for each message. 

b. The end node can restrict all the Send operations for a 
given VI to use the same source port, the same destination 
port, and a single adaptive path. By choosing only one path 
through the network, the end node is guaranteed that each 
Send packet it launches into the network will arrive at the 
destination in order. 

The second approach requires the VIA end node to 
maintain state per VI that indicates which source port 
destination port and adaptive path is currently in use for that 
particular VI. Furthermore, the second approach allows the 
hardware to process descriptors in the higher performance 
NDAL mode. 

With the second approach. Send packets from a single VI 
can stream onto the network without waiting for their 
accompanying acknowledgments. An incrementing 
sequence number is used so the destination node can detect 
missing, repeated, or unordered Send packets. 

Ordering of RDMA Packets 

RDM A operations have slightly different ordering 
requirements than Send operations. An RDMA packet con- 
tains the address to which the destination end node writes the 
packet contents. This allows multiple RDMA packets within 
an RDMA message to complete out of order. The contents of 
each packet are written to the correct place in the end node's 
memory, regardless of the order in which they complete. 

RDMA request packets may be sent ordered or unordered. 
A bit in the packet header is set to a 1 for ordered packets 
and is set to a 0 for unordered packets. As will be explained 
later, this bit is used by the responder logic to determine if 
it should increment its copy of the expected sequence 
number. Sequence numbers do not increment for unordered 
packets. The end node is free to use different source ports, 
destination ports and adaptive paths for the packets. This 
freedom can be exploited for a performance gain through 
multipathing; simultaneously sending the RDMA packets of 
a single message across multiple paths. 

When RDMA Read or Write packets are sent over a path 
that does not exhibit strict ordering with the Send packets 
from the same VI, care must be taken when launching 
packets for the following message. The next message cannot 
be started until the last acknowledgment of the RDMA Read 
or Write operation successfully completes. 

In other words, when multipathing is used to generate 
RDMA Read or Write requests, the hardware must operate 
in the NDAAmode. This ensures the RDMA Read or Write 
is completed before moving on to subsequent descriptors. 

An end node may choose to send RDMA packets strictly 
ordered. This can be advantageous for smaller RDMA 
transfers as the hardware can operate in NDAL mode. The 
VI can proceed to the next descriptor immediately after 
launching the last packet of a message that is sent strictly 
ordered (and hence used incrementing sequence numbers). 

Ordering of Generated Response Packets at the 
Responder 

The ServerNet II end node must respond to incoming 
Send requests and RDMA Write requests from a particular 
VI in strict order, and must write these packets to memory 
in strict order. 
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The ServerNet II end node must also respond to incoming 
RDMA Read requests from a particular VI in strict order. 

Because response packets are transported by the network 
in strict order, the requestor will receive all incoming 
5 response packets for a particular VI in the same order as that 
in which the corresponding requests were generated. 

VIA Message Sequence Numbers 

The ServerNet SAN uses acknowledgment packets to 
inform the requester that a packet completed successfully. 
Sequence numbers in the packets (and acknowledgments) 
are used to allow the sender to support multiple outstanding 
requests to ensure adequate performance and to be able to 
recover from errors occurring in the network. 

FIG. 6 is a graph depicting the generation, checking, and 
updating of VIA sequence numbers at requestor and 
responder nodes. In FIG. 6 time increases in the downward 
direction. Requests are indicated by solid arrows directed to 
20 the right and responses by dotted arrows directed to the left. 

Sequence Number Initialization 

The requestor and responder logic each maintain an 8 bit 
sequence number for each VI in use. When the VI is created, 
25 the requestor on one node and the responder on the remote 
node initialize their sequence numbers to a common value, 
zero in the preferred embodiment. 

After this, the requestor places its sequence number into 
each of the outgoing request packets. As depicted in FIG. 6, 
the sequence number, SEQ, is included in each request 
packet. The responder compares the sequence number from 
the incoming request packet with the responder*s local copy. 
The responder uses this comparison to determine if the 
packet is valid, if it is a duplicate of a packet already 
received, or if it is an out-of-sequence packet An out-of- 
sequence packet can only happen if the responder missed an 
incoming packet. The responder can choose to return a 
^sequence error NACK packet' or it can simply ignore the 
out-of-sequence packet. In the latter case, the requestor will 
^ have a timeout on the request (and presumably on the packet 
the responder missed) and initiate error recovery. Generating 
a Sequence Error NACK Packet is preferred as it forces the 
requestor to start error recovery more quickly. 

The following describes how the sequence numbers are 
generated and checked. 

Generating Sequence Numbers for Request Packets. 

When transmitting ordered packets (i.e. transfers are on a 

5Q specific source port to a specific destination port and the 
ACB specifies a ^ecific lane) the request sequence niunber 
is incremented after each packet is sent. When transmitting 
unordered packets (i.e. multipathing is used and/or the ACB 
bits specify fiill link set adaptivity) the request sequence 

55 number is not incremented after such a packet is sent. 

For example, in FIG. 6, during the first two Send 
transactions, the local copy of the request sequence nmnber 
is incremented after the packet is sent (Rqst. SN=0 to 6), For 
the RDMA operation, which sends 2500 bytes unordered, 

60 the requestor does not increment local copy of the request 
sequence number (Rqst. SN=6). The requestor does not 
increment the local copy of the SN until after the first packet 
of the Send following the RDMA is transmitted. 

Send packets are typically sent fully ordered lest the 

65 requestor have to wait for an acknowledgment for each 
packet before proceeding to the next. On the other hand, 
RDMA packets may be sent cither ordered or unordered. To 
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take advantage of multipathing, a requestor must use unor- 
dered RDMA packets. 

The sender guarantees to never exceed the windowsize 
number of packets outstanding per VI. If S is the number of 
bits in the sequence number, then the windowsize is 2* *(S- 
!)• 

A packet is outstanding until it and all its predecessors are 
acknowledged. The requestor does not mark a descriptor 
done until all packets requested by that descriptor are 
positively acknowledged. 

Checking Sequence Numbers on Incoming Request 
Packets. 

The destination node responding to the incoming request 
packet checks each incoming request packet to verify its 
sequence number against t he res p onder*s loca l copy. 
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The responder logic compares its sequence number with p 
the packet's sequence number to determine if the incoming 
packet is either: 

^ 20 
the expected packet it's looking for (i.e., the packet's 

sequence number is the same as the sequence number 

maintained by the responder logic), in which case the 

responder processes the packet and if all other checks 

are passed, the packet is Acknowledged and committed ^5 

to memory. If the transaction is ordered then the 

responder then increments its sequence number. If the 

transaction is unordered than the responder does not 

increment its sequence number; — 

out-o^sequence packet (which means an earlier; 30 

— inc6m^^"ac5Kt''mi^t''Eave gotten lost), beyond the one^ 

it is looking for in which case the resp^ondexNACKs^yie 

packet and throws it awayrxhe receive logic in the VI 

f is not stopped and the respOndeirHbes not iuCTemeiat its 

sequence number; or 35 

a-duplicate-packct (which~is~being~r6sent because the ~ ^ 

requester must not have received an earlier ack) in ' 

whicfi case the responder Acks the packet and throws it 

^away. If ''thib'^ reguest had "been 'ffi RDMA R'eiad; the 

respon^r tbmplfet%ihe^^ and returns the 40 

dSa^lffil^j^itiv^ aciuiowledgment. ; 

An example of the" re^onder checking sequence numbers 

for ordered and unordered packets is given in FIG. 6. In FIG. 

6, during the first two Send transactions, the responder 

checks that the SEQ in the packet matches the local copy of 45 

Rsp. SN. Since the Send packets include ACB indicating 

ordered then the Rsp. SN is incremented after each response 

packet is transmitted. At the end of the first two Send 

transactions, Rqst. SN and Rsp. SN are both equal to 6. The 

packets for the RDMA include an ACB indicating unordered 50 

receipt is allowed. Neither the requestor nor responder 

increments its local copy of SN. Thus, at the end of the 

RDMA transaction both Rqst. SN and Rsp. SN=6. The first 

packet of the subsequent Send transaction has SEQ=6 and 

SEQ matches the local copy of Rsp. SN. Since Send packets 55 

are ordered, the responder increments its local copy of Rsp. 

SN. 



Sequence Numbers on Response Packets. 

When generating either a positive or negative 
acknowledgment, the responder logic copies the incoming 
sequence number and uses it in the sequence number field of 
the acknowledgment. 

The requestor logic matches incoming responses with the 
originating request by comparing the SourcelD, VI number, 
Sequence number, transaction type, and Transaction Serial 
Number (TSN) with that of the originating request. 
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The invention has now been described with reference to 
the preferred embodiments. Alternatives and substitutions 
will now be apparent to persons of skill in the art. For 
example, the invention has been described in the context of 
the ServerNet II SAN, the principles of the invention are 
usefiil in any network that services both memory semantics 
and channel semantics or uses multi-pathing. Accordingly, it 
is not intended to limit the invention except as provided by 
the appended claims. 

What is claimed is: 

1. In a system area network (SAN) including multiple 
nodes coupled by a network fabric, a system for transferring 
data between a requestor node and a responder node, with 
the SAN implementing data transfers as a sequence of 
request/response packet pairs, a sub-system for managing 
ordered transactions requiring strict ordering of packet 
reception and remote direct memory access (RDMA) that 
allow out-of-order receipt of packet^ said sub-system com- 
prising: 

a first network interface card, coupled to said requestor 
node, including request logic for maintaining a request 
sequence number for each transaction and transaction 
logic for translating ordered transaction requests or 
RDMA transaction requests into sequences of request 
packets and for including a packet sequence number 
equal to the request sequence number in each packet, 
with each packet including a packet order field having 
either a first value specifying that packets must be 
received in order or having a second value specifying 
that packets can be received out-of-order, and where, 
for an ordered transaction, the transaction logic sets 
said packet ordering field in each packet sent to the first 
value and increments the request sequence number for 
each packet sent, and where, for an unordered 
transaction, transaction logic sets said packet ordering 
field in each packet sent to the second value and does 
not increment the request sequence number for each 
packet sent; and 

a second network interface card, coupled to said 
responder node and coupled to receive request packets 
sent from said first network interface card, said second 
NIC including response logic for generating and send- 
ing a response packet for each received request packet 
if the packet sequence number is equal to the response 
sequence number, with said response logic maintaining 
a response sequence number, and with said response 
logic copying the packet sequence number from the 
received request packet into a corresponding response 
packet, incrementing the local value of the sequence 
number subsequent to sending a response packet if the 
packet sequence number is equal to the response 
sequence number and the packet ordering field of the 
corresponding received request packet is equal to the 
first value, and for not incrementing the local value of 
the sequence number if the packet ordering field is 
equal to the second value. 

2. The sub-system of claim 1 wherein each network 
interface card has two ports, the first network interface card 
and the second network interface card being coupled by a 
plurality of paths between the ports with said packets having 
a packet order field equal to the first value transmitted only 
over a single path and packets having a packet order field 
equal to the second value transmitted over a plurality of 
paths. 

3. The sub-system of claim 1 wherein a plurality of 
packets may be transmitted from the requestor prior to 
receiving a response packet for any of the packets transmit- 
ted. 
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4. The sub-system of claim 1 wherein the local copies of 
the sequence numbers are initialized to a common value 
prior to data transfer between the requestor and responder 
nodes. 

5. In a system area network (SAN) including multiple 5 
nodes coupled by a network fabric, a system for transferring 
data between a requestor node and a responder node, with 
the SAN implementing data transfers as a sequence of 
request/response packet pairs between a requestor and a 
responder, a method for managing ordered transactions 10 
requiring strict ordering of packet reception and remote 
direct memory access (RDMA) that allow out-of-order 
receipt of packets, said method comprising the steps of: 

at said requestor: 
maintaining a request sequence number for each trans- ^5 
action; 

translating ordered transaction requests or RDMA 
transaction requests into sequences of request pack- 
ets; 

including a packet sequence number equal to the 20 
request sequence number in each packet, with each 
packet including a packet order field having either a 
first value specifying that packets must be received 
in order or having a second value specifying that 
packets can be received out-of-order, and, for an 
ordered transaction; 
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setting said packet ordering field in each packet sent to 

the first value; and 
incrementing the request sequence number for each 
packet sent, 
and, for an unordered transaction; 

setting said packet ordering field in each packet sent to 

the second value and 
not incrementing the request sequence number for each 
packet sent; 
at said responder: 
generating and sending a response packet for each 

received request packet; 
maintaining a response sequence number; 
copying the packet sequence number from the received 
request packet into a corresponding response packet, 
incrementing the local value of the sequence number 
subsequent to sending a response packet if the packet 
sequence number is equal to the response sequence 
number and the packet ordering field of the corre- 
sponding received request packet is equal to the first 
value, and 

not incrementing the local value of the sequence num- 
ber if the packet ordering field is equal to the second 
value. 

« ♦ * « * 
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